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Abstract 

Until recently, cyber-physical systems, especially those 
with safety-critical properties that manage critical in- 
frastructure fe.g. power generation plants, water treat- 
ment facilities, etc.j were considered to be invulnerable 
against software security breaches. The recently discovered 
'W32.Stuxnet' worm has drastically changed this percep- 
tion by demonstrating that such systems are susceptible to 
external attacks. Here we present an architecture that en- 
hances the security of safety-critical cyber-physical systems 
despite the presence of such malware. Our architecture uses 
the property that control systems have deterministic execu- 
tion behavior, to detect an intrusion within 0.6 fis while still 
guaranteeing the safety of the plant. We also show that even 
if an attack is successful, the overall state of the physical 
system will still remain safe. Even if the operating system's 
administrative privileges have been compromised, our ar- 
chitecture will still be able to protect the physical system 
from coming to harm. 

1. Introduction 

Many systems that have safety-critical requirements 
such as power plants, industry automation systems, automo- 
biles, etc. can be classified as cyber-physical systems (CPS) 
- i.e. a tight combination of, and co-ordination between, 
computational and physical components. Typically the 'cy- 
ber' side aids in the monitoring or control of the physical 
side. These systems (or parts of them) have stringent safety 
requirements and require deterministic operational guaran- 
tees since a failure to meet either could result in physical 
harm to the system, the environment or even humans. 

Such systems have traditionally been considered to be 
extremely secure since they (a) are typically not cormected 
to the Internet; (b) use specialized protocols and proprietary 
interfaces ('security through obscurity') (c) are physically 
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inaccessible to the outside world and (<ij typically have their 
control code executing on custom hardware such as special- 
ized processors or programmable logic controllers (PLCs). 

This misconception of ironclad security in such systems, 
however, has recently been exposed when the 'W32.Stuxnef 
worm (henceforth referred to as just 'Stuxnet') targeted 
and successfully infiltrated a Siemens WinCC/PCS7 control 
system [29-31]. Not only was it able to bypass all the secu- 
rity (digital as well as physical) techniques but it also repro- 
grammed the PLC that controlled the main system. It modi- 
fied the operating frequencies sent from the PLC thus caus- 
ing physical damage to the system [28]. 

In this paper, we specifically address the problem of se- 
curity for physical control systems. Compared to general- 
purpose techniques, our work is different in that we focus 
on domain- specific characteristics of these cyber-physical 
systems and in particular, their deterministic real-time na- 
ture. We introduce an overall system architecture where an 
isolated and trusted hardware component is leveraged to en- 
hance the security of the complete system. We present a 
novel intrusion detection mechanism that monitors context- 
specific side channels on the main CPU - in this particular 
paper we use the deterministic execution time of the con- 
trol system as the main side channel for this purpose ' 

Program execution inherently includes variance due to 
a variety of features, viz. complex control flow (branches, 
unbounded loops, etc.), hardware features (caches, branch 
predictors, bus contention, etc,), system effects (OS noise 
compiler optimizations, network traffic, interrupts, etc.) and 
so on. Attackers often use these characteristics and other 
vulnerabilities in the system to their advantage. Existing 
mechanisms [8, 13] work well in detecting and prevent- 
ing problems, but either require custom configuration of 
reconfigurable hardware for each type of checking mech- 
anism [8] or enforce run-time monitoring and constraints 
on access to data by fine-grained checks on what instruc- 
tions/programs are allowed to access [13]. Either way, there 
is a distinct need to know more details about the program 
and data semantics. Typically good security involves one 
or more of the following principles: (i) knowledge/use of 

1 We elaborate on other potential side-channels in Sections 5.3 and 9. 



control semantics; (ii) details about program and data se- 
mantics and checking mechanisms; (Hi) hardware-enabled 
trust/protection; (iv) externally monitor-able information 
(e.g. real-time execution time profile in our case) and (v) 
robustness/fault-tolerance mechanisms. 

Hence, we present the Secure System Simplex Architec- 
ture (S3A) to improve the security of cyber-physical sys- 
tems that uses (i), (iii), (iv) and (v) from above as follows: a 
combination of knowledge of high-level control flow, a se- 
cure co-processor implemented on an FPGA ^, determinis- 
tic execution time profiles and System Simplex [2,26]. S3 A 
detects intrusions that modify execution times by as low a 
value as Q.Qps on our test control system. With S3A, we ex- 
pand the definition of 'correct system state' to include not 
just the physical state of the plant but also the cyber state, 
i.e. the state of the computer/PLC that executes the con- 
troller code. This type of security is hard for an attacker 
to overcome by reverse engineering the code or the sys- 
tem especially since it involves absolutely no changes to 
the source code/binary. Even if an infection occurs and all 
of the security mechanisms are side-stepped (such as gain- 
ing access to the administrative privileges or the replication 
of our benevolent side channels), the trusted hardware com- 
ponent (secure co-processor) and the robust Simplex mech- 
anism will still prevent the physical system from coming to 
harm, even from threats such as Stuxnet. 

It is important to note that S3A is a system-level solution 
that can integrate multiple dijferent solutions to achieve se- 
curity and safety in this domain. While we picked some 
mechanisms (execution time. Simplex, etc.), other good 
concepts and architectures [8, 13] can also be integrated to 
make the system that much more secure and robust. 

The main contributions of this paper are as follows. 
We present the Secure System Simplex Architecture (S3A) 
where, 

1. a trusted hardware component provides oversight over 
an untrusted real-time embedded control platform. 
This design provides a guarantee of plant safety in the 
event of successful infections. Even if an attacker gains 
administrative/root privileges she cannot inflict much 
harm since S3A ensures that the overall system (espe- 
ciaUy the physical plant) will not be damaged. 

2. we investigate and use of context-dependent side chan- 
nels for intrusion detection. These side channels, mon- 
itored by the trusted hardware component, quaUta- 
tively increase the difficulty faced by potential at- 
tackers. Typically side-channel communication is used 
to break security techniques but we use them to our 
advantage in S3A. In this paper, we focus on side- 
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channels in the context of CPU-controlled real-time 
embedded control systems. 

3. we build and evaluate an S3 A prototype for an in- 
verted pendulum plant and discuss implementation ef- 
forts and the construction of side channel detection 
mechanism for execution time-based side channels us- 
ing and FPGA in the role of the trusted hardware com- 
ponent. The side channel approach is shown to detect 
intrusions significantly faster than earlier plant-state- 
only detection approaches. 

While intrusion detection is a broad area in computer se- 
curity, our approach takes advantage of the real-time prop- 
erties specific to embedded control systems. Also, most of 
the existing side-channel techniques/information (timing, 
memory, etc.) have traditionally been used to break the se- 
curity of systems. This paper, proposes a method to turn 
it around so that these pieces of information are now used 
for increasing the security of the system. Also, such tech- 
niques have not been used before with the perspective of 
safety-critical control systems - hence we believe that this 
paper's contributions are really novel. 

We also beUeve that our approach is generaUzable to 
PLC and microcontroller-based CPS. Our justification is 
twofold; such systems (i) have stringent requirements for 
correct operation, i.e. the physical state of the plant must be 
kept safe under all conditions and ( ii) often require the con- 
troller process to run in a deterministic manner. 

Assumptions: The important assumptions for the work 
presented in this paper are: 

• the system consists of a set of periodic, real-time tasks 
with stringent timing and deadline constraints man- 
aged by a real-time scheduler; such systems typically 
do not exhibit complex control flow, do not use dynam- 
ically allocated data structures, do not contain loops 
with unknown upper bounds, don't use function point- 
ers, etc. - in fact, they are often designed/developed 
with simplicity and determinism in mind 

• the hardware component must be trusted and can only 
be accessed by authorized personnel/engineers - this is 
not unlike the RSA encryption mechanism where the 
person holding the private key must be trusted 

• the systems we describe are rarely updated and defi- 
nitely not in a remote fashion (unlike,say, mobile em- 
bedded devices) - see Section 4 for details. 

Note: Our techniques are not specific to attacks men- 
tioned in Section 2 and tackles the broader class of secu- 
rity breaches of controllers in safety-critical CPS. 

This paper is organized as follows: Section 2 reviews 
breaches in safety-critical systems while Section 3 dis- 
cusses the attack models that affect our work. Section 4 
provides a background of System-Level Simplex. Section 
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5 presents the Secure System Simplex Architecture and im- 
plementation details. Section 6 discusses the evaluation of 
the system. Section 7 discusses the limitations of our ap- 
proach. Finally, related work is reviewed in Section 8 and 
Section 9 concludes the paper and also presents some ideas 
for future work. 

2. Motivation 

Many control systems attached to critical infrastructure 
systems have traditionally been assumed to be extremely 
secure. The chief concern with such systems is related 
to safety, i.e. to ensure that the plant's operations remain 
within a predefined safety envelope. "Security" was attained 
by restricting external access to such systems - they were 
typically not connected to the Internet and only a few peo- 
ple were granted access to the computers that controlled the 
systems. Also, since parts (or even all) of the control code 
executed on dedicated hardware (PLCs for instance), they 
were considered to be secure as well. 

2,1, Stuxnet 

The W32. Stuxnet worm attack [29-31] overturned all of 
the above assumptions. It showed that industrial control sys- 
tems could now be targeted by malicious code and that not 
even hardware-based controllers were safe. Stuxnet em- 
ployed a really sophisticated attack mechanism that took 
control of the industrial automation system executing on a 
PLC. It took control of the system and operated it accord- 
ing to the attacker's design. It was also able to hide these 
changes from the designers/engineers who operate the sys- 
tem. To achieve these results, Stuxnet utilized a large num- 
ber of complex methods the most notable of which was the 
first known PLC rootkit [9]. In fact, Stuxnet was present on 
the infected systems for a long time before it was detected 
- perhaps even a few months. Also, from all the informa- 
tion that is available about the original attack, it seems that 
the worm made its way to the original system through an in- 
fected USB stick. In this section we wiU focus on the real 
target of Stuxnet - the control code that manages the plants 
and the implications of such an attack. 

Stuxnet had the ability Xo (a) monitor blocks that were 
exchanged between the PLC and computer, (h) infect the 
PLC by replacing legitimate blocks with infected ones and 
(c) hide the infection from designers. This results in the op- 
erators being unaware of the infection, since the information 
that they receive (supposedly from the PLC) shows every- 
thing to be operating correctly. The PLCs are used to com- 
municate with and control 'frequency converter drives' that 
manage the frequency of a variety of motors. The mahcious 
code in the infected PLC affects the operational frequency 
of these motors so that they now operate outside their safety 
ranges. E.g., in one instance, the frequency of a motor was 
set to 1410 Hz, then 2 Hz and then to 1604 Hz and the en- 
tire sequence is repeated multiple times - the normal oper- 



ating frequency for this motor is between 807 Hz and 1210 
Hz [28]. Hence, in this instance, Stuxnet's actions can re- 
sult in real physical harm to the system. 

Note: In this work, our focus is not on preventing the 
original intrusion or providing mechanisms to safeguard the 
Windows machines that were originally infected. We intend 
to detect the infection of the control code (on a PLC in this 
example, but could be any computer that runs it) and mainly 
safeguard the physical system from coming to harm. 

2.2, Automotive Attack Surfaces 

Researchers from the University of Washington demon- 
strated how a modern automobile's safety can be compro- 
mised by malicious attackers [6, 15]. They show how an at- 
tacker is able to circumvent the rudimentary security pro- 
tections in modern automobiles and infiltrate virtually any 
electronic control unit (ECU) in the vehicle and compro- 
mise safety-critical systems such as disabling the brakes, 
stopping the engine, selectively braking individual wheels 
on demand, etc. - all of this, while ignoring the driver's in- 
puts/actions. They were able to achieve this due to the vul- 
nerabilities in the CAN bus protocols used in many modern 
vehicles. The attackers also show how malicious code can 
be embedded within the car's telematics unit that will com- 
pletely erase itself after causing the crash. 

This example is important, since the authors showed that 
the safety-critical components of the vehicle can easily be 
targeted, thus putting at risk the humans in the car. One im- 
portant facet to note is that the critical components that were 
attacked, such as engine control unit, braking unit, etc. all 
have stringent real-time properties. Hence, our techniques 
will work well in detecting the intrusions in these safety- 
critical subsystems. 

2.3, Maroocliy Wastewater Attack and otlier Ex- 
amples 

In 2001, an erstwhile employee of a small town in Aus- 
tralia started issuing radio commands (using stolen equip- 
ment) to sewage treatment facihties that he had helped in- 
stall, using stolen equipment [1,27]. This resulted in a lot of 
environmental damage. The attack was hard to track since 
the requisite alarms were not being reported to the central 
computer and this computer couldn't communicate with the 
pumping stations during the attacks. Initially the incidents 
looked Uke anomalous, unintentional events. It was took a 
lot of analysis of the system to understand that there was a 
malicious entity operating to cause these problems. 

There have been numerous other attacks that infiltrated 
critical systems e.g. NRG generation plants [20], canal sys- 
tems [18], medical devices [16], etc. 

2.4, Discussion 

As these examples show, safety-critical systems can no 
longer be considered to be safe from security breaches. 
While the development of cyber security techniques can 
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help alleviate such problems, the real concern is for the con- 
trol systems and physical plants that can be seriously dam- 
aged - often resulting in the crippling of critical infrastruc- 
ture. Hence, we propose non-traditional intrusion detection 
and recovery mechanisms to tackle such problems. We use 
to our advantage the fact that the control codes running in a 
real-time system tend to be deterministic in behavior, sim- 
ple to implement and exhibit strict timing properties. 

For the rest of this paper, we will show how such in- 
trusions can be detected and the harmful effects mitigated 
by use of our Secure System Simplex Architecture (S3A). 
Hence, our aim is to identify, as quickly as possible, that 
an infection has taken place and then ensure that the sys- 
tem (and its physical components) are always safe. Note: as 
stated in the introduction, our work does not aim to prevent 
the original infections since that is a large problem that re- 
quires the development and implementation of multiple lev- 
els of cyber security techniques/research. We focus on the 
aftermath of the infection of control codes. 

3. Threat Model 

We dehberately will not delve too deeply into specific 
threat models, since we believe that our techniques will 
work well for a broad class of attacks that modify the ex- 
ecution behavior of embedded code in safety-critical sys- 
tems. Attacks similar to the ones mentioned in Section 2 
can be caught by the mechanisms presented in this paper. 
Hence, code could be injected by any of the mechanisms 
described in that section - as long as the malicious en- 
tity tries to execute any code, we will be able to detect it. 
Hence, our threat model [14] is quite broad and can detect 
attacks such as: (a) physical attacks, i.e. code injected via 
infected/malicious hardware; (b) memory attacks where at- 
tackers try to inject malicious code into the system and/or 
take over existing code; (c) insider attacks where the attack- 
ers try to gain control of the application/system by altering 
all or part of the program at runtime. 

We will, instead, focus on what happens after attackers 
perform any of the above actions in order to execute their 
code. Hence, we intend to show how our architecture is able 
to quickly detect this and keep the system(s) safe particu- 
larly the physical systems. Since we don't care much about 
what executes and are more concerned with how long some- 
thing executes, our "malicious entity" is a little more ab- 
stract as explained later in Sections 5.4.4 and 6.2. 

4. System Simplex Overview 

The Simplex Architecture [25] utilizes the idea of using 
simplicity to control complexity in order to safely use an un- 
trusted subsystem in a safety-critical control system. A Sim- 
plex system, shown in Figure 1, consists of three main com- 
ponents: 

a. under normal operating conditions the Complex 
Controller actuates the plant; this controller has 



high performance characteristics and is typically un- 
verifiable due to its complexity; 

b. if, during this process, the system state becomes in 
danger of violating a safety condition, the Safety 
Controller takes over; 

c. the exact switching behavior is implemented within a 
Decision Module. 

The advantage of the design is that high-performance com- 
ponents can be used without the requirement that they be 
fully verified. By maintaining a correct safety controller and 
decision module the properties about the safety of the com- 
posite system can be guaranteed. Thus, even if the complex 
controller is upgraded, is faulty or becomes infected with 
malware, we are still assured that the formal safety proper- 
ties can never be violated and the plant remains safe. The 
Simplex architecture has been used to improve the safety of 
a fleet of remote-controlled cars [7], pacemakers [2] as well 
as advanced avionics systems [23]. 

Early Simplex designs had all three subsystems located 
in software - at the application-level. To guarantee com- 
plete system safety, however, other components such as any 
middleware and the operating system need to behave cor- 
rectly. This requirement was relaxed in System-Level Sim- 
plex [2] by performing hardware/software partitioning on 
the system. In System-Level Simplex, the safety controller 
and the decision module are moved to a dedicated process- 
ing unit (an FPGA) that is different from the the micropro- 
cessor running the complex controller. We leverage this par- 
titioning technique in S3A. 

Untrusted Controller: One important question is the use 
of an unverified (and hence, untrusted) complex controller 
in such systems. It is not that designers wish to use unveri- 
fied controllers in such systems. Most such controllers that 
are intended to control anything but the simplest of systems 
are typically very complex and hard to verify. This is es- 
pecially true if they must also achieve high levels of perfor- 
mance. Hence, there could be bugs and/or potential vulnera- 
biUties in the system that attackers could exploit. Even if we 
assume that the controller is completely trusted, it can still 
be compromised (case in point - Stuxnet reprogrammed the 
controller in the PLC). Our technique can protect against 
any such intrusion, be it in trusted or untrusted controllers. 

System Upgrades: Another issue is what happens if the 
system must be updated and that process either (a) breaks 
the safety and timing properties of the system or (b) intro- 
duces malicious code. This is particularly important if such 
updates were to happen in a remote fashion. While these 
would be serious issues in most general-purpose or even 
mobile embedded systems (e.g. cell phones), it is not a prob- 
lem for safety-critical systems. As mentioned in Section 1, 
such systems are rarely updated, if at all. Also, any updates 
have the following properties: 
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Figure 1 : Simplex Architecture 

i. updates are never performed remotely - they are car- 
ried out by trusted engineers; 

ii. most updates are minor in that they only tune cer- 
tain parameters and rarely, if at all, modify the con- 
trol/timing structure of the code - hence they will not 
even modify the safety properties of the system and 

iii. any major changes will require extensive redesign, 
testing, etc. - hence the safety and real-time proper- 
ties of the system must then be re-analyzed anyways. 

One other important point is that the Simplex architecture 
can actually support upgrades to the complex controllers 
[24] in a safe manner. 

Our application of Simplex in S3A, in this paper, has 
several significant differences compared with earlier ap- 
proaches. In the past, the primary motivation to use Sim- 
plex was to aid in the verification of complex systems. In 
this work, we instead apply Simplex to protect against mal- 
ware that has infected the complex controller Another key 
difference is that previously the decision module's behav- 
ior was determined completely by the physical state of the 
plant. In this work, we widen the scope of the "correct state" 
by using side channels from the computational part of the 
system, such as the timing properties of executing real-time 
tasks, in order to determine when to perform the switch- 
ing. The Simplex decision module is now monitoring both, 
the physical system as well as the cyber state of the compu- 
tational system. 

5. Integrated Framework for Security: Se- 
cure System Simplex Architecture (S3A) 

We now present the Secure System Simplex Architecture 
(S3A) that prevents damage from malicious intrusions in 
safety-critical systems as well as aids in rapid detection 
through side-channel monitoring. In this section, we first 
elaborate on the high-level logical framework of the archi- 
tecture. We then discuss aspects of the execution time-based 
side channels that we have implemented in our S3A proto- 
type and then follow it up with details on how to implement 
such a system - from the hardware aspects to the OS modi- 
fications; from the timing measurements to the control sys- 
tem that we use to show the effectiveness of our approach. 



Figure 2: S3A Architecture 
5.1. High Level Architecture 

Figure 2 provides a high level overview of the system ar- 
chitecture. There is a Complex Controller that com- 
putes the control logic under normal operations. The com- 
puted actuation command is sent to the plant and sensor 
readings are produced and given to the controller to enable 
feedback control. There is also a Decision Module and 
Safety Controller in this architecture that are used 
not only to prevent damage to the plant in case of con- 
troller code bugs (as with the traditional Simplex appli- 
cations) but also to prevent plant damage in the case of 
malicious actuation from attackers. We also have a Side 
Channel Monitor that examines the execution of the 
Complex Controller for changes in 'expected' be- 
havior (in this paper it monitors the execution time of the 
Complex Controller to see if there is any deviation 
from what is expected). If the information obtained via the 
side channels differs from the expected model(s) of the sys- 
tem, the Decision Module is informed and control is 
switched to the Safety Controller (and an alarm can 
be raised). The types of side channels we can consider in 
a CPU-based embedded system include the execution time 
profiles of tasks, the number of instructions executed, the 
memory footprint and usage pattern or even the external 
communication pattern of the task. We will discuss timing 
side channels in more detail in the Section 5.2 and elabo- 
rate on the viability of the others in Sections 5.3 and 9. 

This approach is qualitatively more difficult to attack 
than a typical control system. An attacker not only has to 
compromise the main system, but she also has to replicate 
all the side channels that are currently being monitored. If 
the timing of the task execution is being monitored then 
the attacker must replicate the timing profile of a correctly- 
functioning system. If the cycle count is being observed, her 
attack must also be sure to execute for a believable number 
of instructions. Even if all the side channels match the ex- 
pected models, the Decision Module will still monitor the 
plant state and, when malicious actuation occurs, prevent 
system damage. 

The effectiveness of the side channel early-detection 
methodology depends on two factors. First, the constructed 
model of each side channel should restrict valid system be- 
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havior (not easily replicable). Second, the side channel it- 
self must be secure (not easily forgeable). These factors are 
implementation specific and will be discussed later in Sec- 
tion 5.4. 

5.2. Timing Side Channels 

In this paper, we intend to secure a real-time embedded 
system. Therefore, we assume that the system has typical 
real-time characteristics, i.e. the system is divided into a set 
of periodic tasks managed by a real-time scheduler Each 
task has a known execution time and each task periodically 
activates a job. 

The monitoring module maintains a real-time timing 
model of the system. Violations of this timing model oc- 
cur when the, 

i. job execution time is too large; 

ii. job execution time is too small; 

iii. job activation period is too large; or 

iv. job activation period is too small. 

Additionally, the monitoring module needs to examine the 
execution of the idle task. This prevents a malicious at- 
tacker from allowing the real-time task to execute normally 
and perform malicious activity during idle time. Finally, the 
monitoring module should be cognizant of the system ac- 
tivities that may result in timing perturbations. 

In our prototype, we address two of these timing side 
channel requirements: monitoring the control task and the 
idle task. For rapid prototype development, we eliminate 
system noise (disable interrupts) while our control task is 
running to obtain a predictable timing environment^ rather 
than patching system interrupts in order to receive their tim- 
ing information. In a real-time system the interrupts would 
be predictable and scheduled deterministically - hence we 
would be able to monitor them as well as other tasks. This 
addition, however, could be made to our prototype in the fu- 
ture. 

Execution times of the various real-time tasks in such 
systems are anyways obtained as part of system design by 
a variety of methods [32]. There is no extra effort that we 
have to perform to obtain this information. The worst-case, 
best-case and average-case behavior for most real-time sys- 
tems is calculated ahead of time to ensure that all resource 
and schedulability requirements will be met during system 
operation. We use this knowledge of execution profiles to 
our advantage in S3A. 

5.3. Other Potential Time-based Side Channels 

In the assumed context of predictable real-time embed- 
ded control systems, several other side channels are avail- 
able as part of the cyber state. These include the task activa- 
tion periodicity, memory footprint, bus access times and du- 



rations, scheduler events, etc.. Each of these is a candidate 
for benevolent side-channels that can be monitored to de- 
tect infections and would have to be individually detected 
and replicated by an attacker to maintain control in an in- 
fected system, thus qualitatively increasing the difficulty for 
future attackers. 

Additionally, the specific side channels used may vary 
depending on the type of system. E.g., in this paper, we 
focus on CPU-based real-time control systems. Other sys- 
tems, e.g. PLC-based systems, would likely need to either 
monitor the side channels using different mechanisms or 
utilize a completely different (or additional) sets of side 
channels. 

5.4. Implementation 

We now describe a prototype implementation of S3A 
that we have created. The technical details of the prototype 
are listed in Table 1. We will elaborate on key aspects of 
our implementation in detail in the upcoming subsections. 
First, a hardware component overview is provided in Sec- 
tion 5.4.1. Then, the inverted pendulum hardware (our ex- 
ample 'safety-critical control system') setup is described in 
Section 5.4.2. The methodology for timing measurements 
of the control code is described in Section 5.4.3 and the 
methodology for timing-variability ('malicious code') tests 
is presented in Section 5.4.4. Section 5.4.5 gives essential 
details about the operating system setup during the mea- 
surements. Finally, Section 5.4.6 describes the specific de- 
sign of the Decision Module and the timing Side 
Channel Monitor. 



Component 


Details 


Inverted Pendulum 


Quanser IPOl 


FPGA 


Xilinx ML505 


Computer with Controller 


Intel Quad core 2.6 GHz 


Operating System 


Linux kernel ver. 2.6.36 


Timing Profile 


Intel Timestamp 
Counter (rdtsc) 



Table 1: S3 A Prototype Implementation Details 
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Details in Section 5.4.5. 



Figure 3: S3 A Implementation Overview 
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5.4.1, Hardware Components A high-level hardwai-e de- 
sign of our prototype is shown in Figure 3. The proto- 
type hardware instantiates the logical Secure System Sim- 
plex architecture previously described in Section 5 and 
shown in Figure 2. In our implementation, we run the 
Complex Controller on the main CPU. The Complex 
Controller communicates with a trusted hardware compo- 
nent, an FPGA in this case, to perform control of an inverted 
pendulum. Sensor readings are obtained by the FPGA over 
the PCIe bus using memory mapped I/O. The actuation 
command, in turn, is written to the memory-mapped re- 
gion on the FPGA. Additionally, timing messages in the 
form of memory-mapped writes are periodically sent to the 
FPGA based on the state of execution (at the start/end of 
the control task and periodically during the Idle Task). 
This creates a timing side channel that can be observed by a 
Timing Channel Monitor running on the FPGA. On 
the FPGA side, the Timing Channel Monitor will 
measure the time elapsed between timing messages from 
the Complex Controller to ensure that the execution 
conforms to an expected timing model. The Decision 
Module will periodically examine the output of the Timing 
Channel Monitor, the actuation command from sent by the 
Complex Controller from Shared Memory on the FPGA, 
the actuation command from the locally-running Safety 
Controller and the state of the plant from a Sensor 
and Actuator Interface and decide which con- 
troller's actuation command should be used - the complex 
one on the CPU or the safe one on the FPGA. The actuation 
command is then output back to the Sensor and Actuator In- 
terface. The interface then, through a digital-to-analog con- 
verter, actuates the Plant - in our case, an inverted pen- 
dulum. The Sensor and Actuator Interface also periodically 
acquires sensor readings through analog-to-digital convert- 
ers and write their values to both shared memory accessi- 
ble by the Complex Controller and to memory accessible 
by the trusted Decision Module and Safety Controller. 

5.4.2. Inverted Pendulum We used an inverted pendulum 
(IP) as the plant that was being controlled. An inverted pen- 
dulum, like the one shown in Figure 4, is a classic real-time 
control challenge where a rod must be maintained in an up- 
right position by moving a cart attached to the bottom of 
the pendulum along a one-dimensional track. There are two 
sensors to measure both the current pendulum angle as well 
as the cart position on the track and there is one actuator 
(the motor near the base of the pendulum) used to move 
the cart. Two safety invariants must be met: (1) the pendu- 
lum must remain upright (can not fall over) and (2) the cart 
must remain near the center of the track. The specific in- 
verted pendulum we used in our testbed was based on the 
Quanser IPOl linear control challenge [11]. 

Our setup varies slightly from an off-the-shelf version of 
the Quanser IPO I. The most important difference is that we 




Figure 4: An inverted pendulum control system maintains 
an upright rod along a one-dimensional track. 

needed to directly connect the sensors and actuators to the 
FPGA hardware while the prebuilt setup requires a com- 
puter to do the data acquisition. We modified the system 
to redirect the sensor values and motor commands through 
an Arduino Uno microcontroller that communicates directly 
with the S3A FPGA through a serial cable. Although this 
change can potentially introduce latency into the system, 
we did not observe any issues with safely actuating the pen- 
dulum due to this small delay. The control code that man- 
ages the IP executes on a computer (Section 5.4.5 and Table 
1). This control code executes at a frequency of 50 Hz. 

Note: The inverted pendulum has been used quite exten- 
sively in literature to be an appropriate example of a real- 
time control system [2,25]. Hence we believe it would suf- 
fice to show an early prototype of our solutions. We are cur- 
rently working on applying these techniques to other real 
control systems in conjunction with power vendors. 

5.4.3. Timing The implementation of the complex con- 
troller for the inverted pendulum is fairly simple with very 
few branches and most loops being statically decidable^. 
Hence it is fairly easy to calculate the execution time and 
number of instructions taken for such code. In our frame- 
work, we utilized simple dynamic timing analysis [32] 
methods to obtain an execution profile of the code. We 
used the Intel time stamp counter (rdtsc) [12] to obtain 
high resolution execution time measurements for the con- 
trol code. 

The profile consisted of the 'worst-case,' 'best-case' and 
'steady-case' numbers for the control code that was ob- 
tained by executing it multiple times on the actual computer 
where it would execute and measuring each set of execu- 
tions. 'Steady-case' refers to the values obtained when the 
execution time has stabilized over multiple, repeated execu- 



4 This is typical of most control code in safety-critical and real-time con- 
trol systems - hence our implementation of the controller for the in- 
verted pendulum is also similar. 
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tions - i.e. when the initial cold cache related timing dila- 
tion at the start of the experiments no longer occur 

The control code was placed in a separate function and 
called in a loop. As part of our experiments, the loop was ex- 
ecuted 1, 10, 100, 1, 000, 10, 000, 100, 000 and 1, 000, 000 
times. During each of these scenarios, the total time of the 
loop as well as the times taken up during each individual 
iteration was measured. From these traces we were able 
to determine the maximum (worst-case), minimum (best- 
case) and steady-case values for the execution time of the 
controller code. To reduce the noise from instrumentation 
and overheads of the loops, function calls, etc. we used 
the 'dual-loop timing' method: i.e., empty loops with only 
the measurement instrumentation were timed as a 'con- 
trol' experiment. The execution times obtained for these 
instrumentation-only loops were subtracted from the exe- 
cution times for the loops with the control code. 

Interrupts (all interrupts including inter-processor inter- 
rupts) were disabled during the timing measurements. To 
reduce the timing effects of the operating system and other 
system issues we isolated our controller as best as we could 
as we will describe in Section 5.4.5. 

While we used simple measurement-based schemes for 
obtaining the execution profile for the control code in this 
paper, it does not preclude the use of other more sophisti- 
cated analysis techniques [19,32] to obtain better (and safer) 
timing estimates. This is especially true if the code is more 
complex than the one used for the inverted pendulum. In 
fact, the better the estimation methods, the better S3A will 
be able to detect anomalies and intrusions. 

5.4.4. Execution Time Variation To mimic the effect of 
code modification on timing, we insert extra code into the 
execution of the control loop function described in Section 
5.4.3. Specifically, the extra code is a loop with a varying 
upper bound (i.e. 1, 10, 100) that performs multiple arith- 
metic operations (floating point and integer). The idea be- 
ing that the extra time/instructions that execute will make 
it look like an intrusion has taken place. Our S3A system 
will then detect the additional execution, raise an alarm and 
transfer control to the simple controller on the FPGA. 

Note: As mentioned before, we are less interested in 
what kind of code executes "maliciously" because our de- 
tection scheme does not depend on this detail. We only need 
to check whether whatever is executing has modified the 
timing profile of the system. 

5.4.5. System and OS Setup As stated in Table 1, we 
used an off-the-shelf multi-core platform running Linux 
kernel 2.6.36 for our experiments. Since we use a COTS 
system there are many potential sources of timing noise 
such as cache interference, interrupts, kernel threads and 
other processes that must be removed for our measurements 
to be meaningful. In this section we describe the configura- 



tion we used to best emulate a typical uni-processor embed- 
ded real-time platform. 

The CPU we used is an Intel Q6700 chip that has four 
cores and each pair of cores shares a common level two 
(last level) cache. We divided the four cores into two parti- 
tions: 

1. the system partition running on the first pair of cores 
(sharing one of the two L2 caches) handles all inter- 
rupts for non-critical devices (e.g., the keyboard) and 
runs all the operating system activities and non real- 
time processes (e.g., the shell we use to run the exper- 
iments); 

2. the real-time partition runs on the second pair of cores 
(sharing the second L2 cache). One core in the real- 
time partition runs our real-time tasks together with the 
driver for the trusted FPGA component; the other core 
is turned off so that we avoid any L2 cache interfer- 
ence among these two cores. 

5.4.6. Detection In our implementation, detection of ma- 
licious code can occur in one of two ways. The decision 
module observes both ( /) the physical state of the plant (by 
traditional Simplex) as well as ( //) the computation state of 
the system (based on timing messages; S3A). A violation 
of the physical model or the computational model can trig- 
ger the decision module to switch control to the safety con- 
troller on the FPGA. The physical model is monitored as 
described in previous work [3,4,25]. Based on a function of 
the track position and pendulum angle, the decision mod- 
ule may choose to switch over to the safety controller. 

The computational system is also monitored for viola- 
tions of the expected timing model of the system. Both the 
control task and the idle task are monitored in order to pe- 
riodically send timing messages to the FPGA. The FPGA 
contains an expected timing model of the system that is a fi- 
nite state machine (FSM) running in hardware. When tim- 
ing messages arrive, or timers expire, the finite state ma- 
chine state can advance. If malicious code were to execute, 
it would have a limited window of time to replicate the tim- 
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Figure 5: FSM for Detecting Timing Model Violations 
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ing side channel before it was detected by the Simplex mod- 
ule. 

Generally speaking, monitoring the timing of a real-time 
system can be performed by maintaining state about each 
task in the system. Each task would have a two timers: 

I. the first would enforce the execution time of the task 
II. the other will monitor periodic activation of the task 

Furthermore, a stack would be used to track task preemp- 
tions. Since typical real-time systems use priority-based ex- 
ecution, all task switches will be directly observable by the 
FPGA through task start/task end messages. 

For our specific prototype, we implemented, in hard- 
ware, the finite state machine on an FPGA as shown in Fig- 
ure 5. In our system there are two tasks: (i) the idle task and 
( ii) the controller task. Since only a single task may be pre- 
empted (the idle task), we maintain a single variable as the 
call stack, state/. Three timers are present: cUcc and cUcp 
maintain the execution time and period of the control task, 
cUc/ maintains the execution of the idle task. In the figure, 
cUcc ticks while the control task is running (states Ci and 
C2) and cUc/ ticks while the idle task is executing (states 
Il and I2). Clkp always ticks. The finite state machine 
is parameterized with six values: MustWaitc, CanWaitp, 
MustWait/, CanWait/, MustWaitp, and CanWaitp. These 
values are determined by the minimum and maximum time 
permitted between timing messages. The MustWait time in- 
dicates the minimum time that must elapse, whereas the 
CanTime indicates the jitter permitted between different it- 
erations of the loop. To say it another way, MustWait is 
the minimum execution time of the task/idle loop/period, 
whereas (MustWait + CanTime) is the maximum execution 
time. 

In the finite state machine, initially the control task 
is running. State Ci is entered until clkc ticks from 
MustWaitc to 0. Then state C2 is entered. If clkc ticks from 
CanWaitp to without the end task message, the control 
task has executed for too long and a timing violation oc- 
curs (indicated by the dotted arrow in state C2). Once the 
end control task message is received, the idle task begins to 
execute. Under normal operation, the state will change be- 
tween Il and I2 several times, until the control task is reac- 
tivated and state Ci is again entered. Any messages that ar- 
rive without explicit transitions in the timing FSM are inter- 
preted as CCTors in the prototype and trigger decision mod- 
ule to switch to the safety controller. Additionally, the dot- 
ted transitions in the FSM are timing violations that also 
trigger the decision module to take corrective action. 

The FSM can also be used to tightly track the execution 
behavior of the code for more sophisticated controllers, e.g. 
if the control code has many branches, function calls, etc. 
For instance, when the control code reaches a branch that af- 
fects the overall execution time, a message can be sent to the 



FSM about which side of the branch was taken. The FSM 
can now use this information to accurately track the execu- 
tion of the program for all control constructs in the code. 

6. Evaluation 

In this section we evaluate the Secure System Simplex 
architecture - first the we present timing results that we ob- 
tained by analysis of the controller code (Sections 6.1 and 
6.2) - these values are used to form the profile of the execu- 
tion behavior that is then used in intrusion detection mech- 
anism on the FPGA. We then present the details of the in- 
trusion detection in Section 6.3. 

6.1. Timing Results and Execution Profile 
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Figure 6: Summary of the Timing Results 

Figure 6 shows a high level summary of the timing ex- 
periments used to obtain the execution profile of the com- 
plex controller code that executes on the computer (Fig- 
ure 2). We used dynamic timing analysis techniques to ob- 
tain the worst, best and steady state execution times for this 
code. The x-axis represents the number of times the con- 
troller code was repeatedly executed: from 1 to 1, 000, 000 
in steps of 10. The y-axis represents the execution time in 
cycles. Each grouping of vertical lines represents the 'worst- 
case', 'steady-state' and 'best-case' execution times for that 
experiment. 'Steady-state' refers to the execution time when 
successive executions of the controller code resulted in the 
same execution time - i.e. the situation when the execution 
reached a steady state. This is compared to the first few it- 
erations, when cache and other hardware effects would re- 
sult in a higher variance in the execution time of the code 
- the 'worst-case' numbers in the graph are usually from 
these first few iterations before the system effects (in par- 
ticular the cache) have settled down. This is the reason why 
there exists a slightly larger difference between the worst- 
case and best-case numbers. 

Each vertical bar is split into two parts - the lower part 
shows the instrumentation overhead for that experiment''. 
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while the top is the part that represents the pure timing for 
the control code only. We also see that the instrumentation 
overhead is almost the same across all experiments - oscil- 
lating between 260 and 270 cycles for all experiments. 

As seen in the graph, the steady state and best-case val- 
ues are very close, not just within the same experiment, but 
across experiments. The largest difference between the two 
is 360 cycles for the n = 100, 000 experiment. This just 
shows that our assumption that controller codes in safety- 
critical systems are simple and have little variability is valid. 
This lack of variability is also evident from the fact that 
the worst-case execution cycles, across experiments, do not 
show much variance. The worst-case values for the last ex- 
periment (1, 000, 000) has a slightly higher value of 16, 560 
and this can be chalked down to the initial cold cache and 
other system effects. 

Figure 7 shows the execution profile for one timing ex- 
periment in particular - that of 100, 000 iterations. The x- 
axis is the iteration number while the y-axis is the num- 
ber of cycles for each iteration. As this figure shows, the 
first few iterations take a little longer (around 17K cycles) 
and then most of the execution stabilizes to within a nar- 
row band of: 

1, 590 cycles = 14, 660 - 13, 070 
i.e. - 0.6 fj.s at 2.67 GHz 

Hence, this band will define the 'accepted range' of values 
that the FPGA will use to check for intrusions. Any execu- 
tion that changes the steady state execution time by more 
than this narrow range will be caught by the FPGA. In fact, 
the FPGA will catch variance in either direction - i.e. an in- 
crease as well as a decrease in execution time. 

The graph also shows that while the majority of execu- 
tion times fall within a small band at the lower end of the 
above mentioned range, some values also fall into a narrow 



band at the top of the range (i.e. around the 14K value). This 
nan^ow band of increased execution times for some experi- 
ments is due to latent system effects that we were not able to 
remove. The main culprit is the last level cache that, in this 
architecture, uses a random replacement policy. Hence, ev- 
ery once in a while a few of our controller's cache lines are 
evicted by periodic kernel threads that we could not eas- 
ily disable (since we are running a COTS operating system) 
and these iterations take a few hundred cycles extra (any- 
where from 500 — 900) to execute. With a more predictable 
cache replacement policy, like the ones used in hard real- 
time systems, we would not see this behavior. To prove this 
theory we ran the same measurements on a PowerPC that 
uses psuedo-LRU (Last Recently Used) cache replacement 
policy in its last level and all the points are clustered into a 
single band. In fact, with LRU, tasks would not evict each 
other cache's lines, unless the cache is not big enough to fit 
them at the same time^ 

Figure 7 also shows a few sporadic experiments exhibit- 
ing much higher execution times. Again, this is due to sys- 
tem effects, and in particular, contention on the bus when 
communicating with the FPGA. As explained in Section 
5.4.1, the complex controller reads and writes messages to 
and from the FPGA to control the pendulum and to send 
the timing messages. Many a time, while the complex con- 
troller is waiting for data from the inverted pendulum (an- 
gle and track position) that arrives on the common bus, the 
incoming messages experience unpredictable delays. These 
delays are due to bus contention among the FPGA and other 
peripherals sharing the same bus. 

To prove that the communication with the FPGA was 
the cause of these effects, we conducted timing experiments 
where the FPGA was switched off and all calls to communi- 
cate with it (read/write) resulted in null function calls. Fig- 
ure 8 shows the results of these experiments for the 100, 000 



5 As explained in Section 5.4.3, we used dual-loop timing techniques to 
obtain the overheads due to the instrumentation. 



6 If this is the case, we just have to account for it, when we compute the 
execution time for each task. 
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iterations point. This experiment highlights two important 
points: 

• the random spikes at higher values no longer exist, thus 
showing that the bus contention due to communication 
with the FPGA was the main cause of the spikes 

• the same 'double-band' of execution results also ap- 
pears here; the interesting fact is that the gap between 
the bands is almost exactly identical to that of Figure 7, 
thus providing more evidence to the fact that the cache 
(and its replacement policy) is the culprit. 

Such issues could be avoided in an actual hard real-time 
system instead of the COTS-based experimental setup that 
we use here. In fact, a hard real-time system would use a 
more predictable bus, or other techniques [5], that allows 
designers to bound I/O contention and avoid random spikes. 

6.2. Malicious code Execution Results 
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Figure 9: Execution Cycles for Malicious Code execution 

We introduce "malicious code" by inserting extra in- 
structions (Section 5.4.4) - i.e. a loop of variable size within 
the complex controller code. The upper bounds for the ma- 
licious loop are one of 1, 10, 100 - we stopped at the upper 
limit of 100 since anything over this value would put the ex- 
ecution of the "infected" control code over the real-time pe- 
riod of the task. Also, as we will see soon, even these small 
additional increases in execution times will be caught by the 
monitoring framework of S3A. 

Figure 9 shows the execution time (in cycles on the y- 
axis) taken up by the code for value of the malicious loop 
values (x-axis). The final bar in the graph represents the 
"base," i.e. the number of execution cycles taken up by the 
controller code without any malicious loop. As expected, 
the values for the malicious code increases significantly 
with each increase in the loop bound. Even the smallest sign 
of the presence of the malicious loop puts it outside of the 
narrow range (0.6/is) explained in Section 6.1. Hence, even 
this will be caught by S3A and control will be transferred 



over to the simple controller executing on the FPGA. Note: 
Since we don't really care what executes as part of mali- 
cious code and intend to only catch variations in execution 
time, we only mimic the increased execution time effects by 
the methods discussed in this section. 

6.3. Intrusion Detection 

In this section, we describe the evaluation of our tim- 
ing side channel intrusion detection technique. First, we de- 
scribe measurements of the timing of key aspects of our 
overall architecture. Then, we demonstrate the early detec- 
tion of malicious code execution using the timing side chan- 
nel approach compared with monitoring the plant state only 
(say, by used of traditional Simplex). The results of our in- 
trusion detection measurements are summarized in Table 2. 

Our first timing measurement was to check the overhead 
of sending timing messages to the FPGA. Although the mes- 
sage itself takes time to propagate through the PCle bus to 
the FPGA, the CPU is not stalled during this time. By us- 
ing the time stamp counter we measured the overhead on the 
CPU for sending a single timing message to be 130 cycles 
(50 nanoseconds). This time is extremely small and there- 
fore each process could realistically send multiple messages 
during a single iteration of each control loop to reduce the 
time an attacker has to replicate the timing side channel. An- 
other advantage of having multiple timing messages per it- 
eration is that if the program contains branches, we could 
communicate to the FPGA timing monitor (at run-time) in- 
formation about which branch was taken thus allowing for 
tighter monitoring of the timing requirements in the timing 
model FSM. 

The second timing measurement we did was to quan- 
tify the jitter of the timing messages through the intercon- 
nect going to the FPGA. We performed this measurement by 
recording the difference between the arrival of the start con- 
trol iteration timing message and the end control iteration 
timing message, in the FPGA, over several thousand itera- 
tions of the control loop. The reason for this jitter is twofold: 
(a) one source is the jitter of the execution time itself (the 
difference between the minimum and maximum execution 
time as shown in Figure 5) and (b) the second source of jit- 
ter is the varying time of message propagation through the 
PCle bus. 

Since our testbed was a multicore system, processes run- 
ning concurrently on other cores as well as other indepen- 
dent bus masters such as peripherals may cause interfer- 
ence on the shared interconnect. Although, as we already 
said, in a deployed real-time control system, such noise 
would not be present or at least bounded. On the other hand, 
our testbed was essentially an off-the-shelf installation of 
Linux running on COTS hardware. Nonetheless, we mea- 
sured the typical timing variation caused by the intercon- 
nect to be about 0.6 microseconds, or less than one eighth 
of the iteration time of a single control iteration in this case. 
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Measured Quantity 


Time (/is) 


Control Task Execution Time 


4.8 - 5.4 


for single iteration 




Interconnect Extra Jitter 


-0.6 


Enforced Iteration Time 


4.6-5.7 


Timing Anomaly Detection Time 


5.7 


(for Inverted Pendulum system) 




Vanilla Simplex Anomaly Detection Time 


10,000 


Timing Message CPU Overhead 


0.05 



Table 2: Measured Timings during Intrusion Detection 



Any malicious code that increases the execution time of the 
task by more than this amount would be detected by the 
FPGA timing monitor. As shown in Table 2, we can now de- 
tect an intrusion using the timing-based side channel within 
5.7^s and anything that changes the timing by 0.6/us would 
be caught. Furthermore, we could also add multiple timing 
messages in each control iteration (since the CPU message 
overhead is so low) to further reduce the maximum intru- 
sion detection delay. 

In Table 2, the control task execution time was ob- 
tained from from the execution time measurements from 
Section 6.1. The values in the table are in absolute time 
which was converted from the cycle count measurements 
we performed. Hence, the 4.6 — 5.7/i.s value for the 'Con- 
trol Task Execution Time' is obtained from the (approx.) 
13,000 — 14,000 cycles that we discussed in Section 6.1 
and Figure 7. 

Due to the extra jitter caused by the interconnect, the en- 
forced iteration time is expectedly larger than the measured 
control task execution time. The maximum enforced iter- 
ation time, 5.7/^s, is the maximum time the experimental 
framework can proceed without a timing message before 
the safety controller takes over. To state it another way, in 
the FSM in Figure 5, the runtime value of MustWaitc 
is 4.6/us, and the runtime value of CanWaitc is about 
1.1/US {mustWaiti and canWaiti are much lower). Given 
those numbers, the side-channel monitor FSM will detect a 
missed timing message within 5.7/iS, i.e. the detection time 
reported. 

We now compare the early detection of malicious code 
through timing side channels with the behavior of the de- 
cision module of the S3A with the situation when it only 
monitors the plant state (vanilla Simplex). In the timing side 
channel version, as discussed above, the maximum time that 
can proceed before without valid timing messages is 5.7/is. 
For the vanilla Simplex version, we experimentally mea- 
sured the amount of time needed to detect an intrusion. Af- 
ter taking control of the system, we immediately tried to 
destabilize the pendulum by sending a maximum voltage 
value in the direction which would most quickly collapse 
the pendulum (in order to see a lower bound on the detec- 
tion time when plant state is monitored alone). In this ex- 



periment, we were able to detect an intrusion after 5 con- 
trol iterations, or 100 milliseconds. It is clear that the use 
of timing side channels enables significantly faster detec- 
tion of security vulnerabilities in real-time control systems: 
over four orders of magnitude faster than with traditional 
Simplex. 

7. Limitations 

The proposed S3A Architecture is not a silver bullet for 
intrusion detection in embedded control systems and does 
have some practical restrictions which may limit its appli- 
cability. Firstly, in order to use the Secure System Simplex 
Architecture in a real system, the system needs to be de- 
signed with the architecture in mind. If there is no way to in- 
sert a decision module between the controller and the plant 
then the architecture can not be used. While this is a limita- 
tion for some existing systems, we think that the design of 
future systems could provision for such techniques; after all 
it is never a bad idea to consider security aspects when de- 
signing a new system. 

One concern regarding the correctness of the approach is 
making sure that an attacker cannot easily rephcate our side 
channels. E.g. , if a processor instruction count side channel 
is created by naively sending the current instruction count 
value to the the Monitoring Module then a malicious 
entity could easily store and then replay these values. These 
types of restrictions could be overcome with minor modifi- 
cations to the processor architecture. In this instance, allow- 
ing the FPGA to directly access the instruction count with- 
out involving explicit communication from the CPU would 
ehminate the possibility of spoofing. 

Additionally, for each side channel, a model of the cor- 
rect behavior must be created that would restrict a mali- 
cious program. For our timing side channel, one problem 
could be that the execution of the task has too much of 
a difference between the minimum and maximum execu- 
tion times to provide real restrictions on system behavior. 
While this could be the case in general purpose systems, it 
is not very likely in CPS with real-time constraints. Even so, 
this could be overcome at runtime by having each timing- 
behavior-modifying branch point send a timing message to 
the FPGA indicating what path was taken. This would per- 
mit an extremely tight bound to be placed on the execu- 
tion time at the expense of a more complicated state ma- 
chine to detect timing anomalies. The construction and tun- 
ing of the timing parameters of the state machine is also 
currently a manual process. We beUeve this could eventu- 
ally become a more automatic step in the procedure by per- 
forming a compile-time analysis of the control flow graph 
of the code - indicating where to send the timing messages 
and using run-time analysis to perform precise timing mea- 
surements. 

The implementation of the trusted FPGA hardware in our 
framework must be correct for the system to be secure. This 
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may seem like we have just moved the problem over to se- 
curing the FPGA system instead of the main system, but this 
is not exactly the same for the following reason: the FPGA 
and Safety Controller only need to maintain the safety of 
the plant. The Complex Controller, on the other hand, can 
perform useful work with the plant so any upgrades will 
be made to the Complex Controller and not to the FPGA's 
safety logic. The Complex Controller's timing profile would 
need to be upgraded but that could be done in a restricted 
way to prevent modification to the Safety Controller and 
Decision Module. Of course, we should not permit FPGA 
reconfiguration at runtime and the trusted hardware plat- 
form could even be created on an Application-Specific In- 
tegrated Circuit (ASIC) instead of an FPGA in order to en- 
force this. 

One issue related to the use of FPGAs in such systems 
is that sometimes the complex controller might require the 
use of complex floating point calculations and such floating- 
point computation units are typically not present on FPGAs 
since they use up significant area. The FPGA in our archi- 
tecture is used as a rapid prototype of the trusted simplex 
component. A deployment implementation would likely use 
a trusted microcontroller along with any capabilities (float- 
ing point unit) that are needed for the safety controller, deci- 
sion module, and side-channel monitor. Also, as mentioned 
before, the FPGA will only host the safety controller that 
maintains bare functionality. Hence, it is unlikely that it will 
need to perform fancy floating point calculations. 

Finally, the original Simplex, in general, can only protect 
the systems from properties known up front to result in un- 
safe states. E.g. in Stuxnet, the malicious controller would 
actuate the plant motor for periods at very high frequencies 
and then for periods at very low frequencies in order to dam- 
age the motors. If the Decision Module was not mon- 
itoring this property, such unsafe actuation would still pro- 
ceed to the plant. 

8. Related Work 

The closest work to S3 A is by Zimmer et. al. [34]. They 
use worst-case execution time (WCET) information to de- 
tect intrusions in hard real-time systems by instrumenting 
the tasks and schedulers and periodically checking whether 
the execution has gone past the expected WCET values. 
Our work is more focused on detecting intrusions in real- 
time control systems and ensuring that the plant remains 
safe even if the intruder is able to bypass all the detec- 
tion/security mechanisms. Also, in our work, the system 
remains safe even if the intruder gains root privileges to 
the system - the work by Zimmer et. al. cannot withstand 
this level of intrusion since an attacker with root privileges 
can bypass all the checking mechanisms. Also, our check- 
ing/monitoring is performed by a trusted hardware compo- 
nent that is separate from the main system thus increasing 
the overall robustness of the architecture. 



The trusted computing engine (TCE) [13] as well as the 
reUability and security engine (RSE) [14] also use secure 
co-processors to execute security-critical code and to mon- 
itor the access of critical data. During setup the security- 
critical appUcation is loaded on the TCE and then access 
to it is monitored during runtime. To detect other secu- 
rity violations, compile-time analysis is performed to de- 
termine the critical data, the dataflow and what parts of 
the code are allowed to access this data. At runtime, RSE 
monitors all of this information to see if unauthorized in- 
structions/programs access this critical data. While these 
techniques could be combined with S3A (since they are 
more about intrusion prevention), we don't need to know 
the information about what data is critical or even touch the 
source code. We detect intrusions by observing the innate 
characteristics of the program at runtime. 

The IBM 4758 secure co-processor could also be used 
to perform intrusion detection [33]. This work contains a 
CPU, separate memory (volatile and non-volatile) along 
with cryptographic accelerators and comes wrapped in a 
tamper-responding secure boundary. The main methods em- 
ployed for intrusion detection included checking the system 
for invariants (one example was that a normal user's 'uid' 
should never change to root) and detecting related viola- 
tions. Also, they used it to execute the virus checking pro- 
grams since it couldn't be tampered with. While we could 
adapt this processor for use with our architecture, the main 
difference from S3A lies in the fact that we employ the in- 
herent characteristics of the program to detect intrusions, es- 
pecially in the CPS domain; also coupling with the System 
Simplex mechanism increases the robustness of the over- 
all system. 

FlexCore [8] uses a reconfigurable fabric to implement 
monitoring and book-keeping functions. It can be used to 
implement bookkeeping mechanisms and specific security 
methods such as array bounds checking, uninitialized mem- 
ory checks, dynamic information flow tracking, etc. in the 
reconfigurable hardware. While many of these functionali- 
ties could be implemented in S3 A, the main difference with 
FlexCore lies in the fact that we fa) don't need to know what 
types of attacks are taking place (as long as it modified the 
execution time behavior of our code) and (b) don't need to 
analyze the program structure/data as will be the case with 
FlexCore. 

Pioneer [22] uses sophisticated checksum code and its 
execution time information to establish safe remote execu- 
tion on an untrusted computer. The checksum code is care- 
fully designed so that any malicious modification will result 
in increased execution time that will be detected by the re- 
questing computer. While their goal is remotely executing 
arbitrary code safely on untrusted computers, our goal is to 
detect the behavioral changes of known code running on po- 
tentiaUy compromised computers. 
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TVA [10] provides guarantees that the software running 
on the computer is safe in conjunction with a hardware 
trusted component (TPM). While they also use trusted hard- 
ware, our approach differs in that we're not trying to prevent 
intrusions or attacks. Our aim is to detect these quickly and 
maintain the physical safety of the plant. 

Other related work is the use of PRET (precision timed 
machines) to detect and protect against side-channel at- 
tacks [17]. While their work is focused on preventing at- 
tacks based on side-channels we use them for a benevolent 
purpose - to improve the overall security of the system. 

9. Conclusions and Future work 

In this paper we presented a new framework named Se- 
cure System Simplex Architecture (S3A) that enhances the 
security and safety of a real-time control system such as 
a SCADA plant. We use a combination of trusted hard- 
ware, benevolent side-channels, OS techniques and the in- 
trinsic real-time nature (and domain-specific characteris- 
tics) of such systems to detect intrusions and prevent the 
physical plant from being damaged. We were able to de- 
tect intrusions in the system in less than 6 /xs and changes 
of less than 0.6 /is - time scales that are extremely hard for 
an intruders to defeat. This paper also shows that even if 
an attacker is able to bypass all security/intrusion detection 
techniques, the actual plant will remain safe. Another im- 
portant characteristic of these techniques is that there are no 
modifications required in the source code. We believe that 
the novel techniques and architecture presented in this pa- 
per will significantly increase the difficulty faced by would- 
be attackers thus improving the security and overall safety 
of such systems. 

The intrusion detection capabilities of S3A can be fur- 
ther enhanced by monitoring multiple side channels and/or 
improving the predictability of the system. E.g., with the 
current implementation, the more the system is predictable, 
the less will be the jitter measured by the timing analysis, 
the tighter can be the execution time range enforced by the 
Secure Simplex. 

For future work we plan to investigate other side chan- 
nels. For instance, instruction count can be used in S3A so 
that a deviation in the number of instructions can be treated 
as an indication of the existence of malicious code. Fairly 
small modifications in the processor could enable trusted 
hardware to access the CPU instruction counter, thus en- 
abling an instruction-based side channel. Finally, a pre- 
dictable execution model like PREM [21], can also consid- 
erably enhance system predictability and hence, the preci- 
sion of timing side channel. In fact, PREM can almost elim- 
inate the jitter in execution time jitter that results from bus 
and memory contentions. 
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